Skip to content

Extend histogram bucket cap for HTTP and catalog metadata update timers#567

Merged
cbb330 merged 1 commit intolinkedin:mainfrom
abhisheknath2011:latency-metric
May 1, 2026
Merged

Extend histogram bucket cap for HTTP and catalog metadata update timers#567
cbb330 merged 1 commit intolinkedin:mainfrom
abhisheknath2011:latency-metric

Conversation

@abhisheknath2011
Copy link
Copy Markdown
Member

Summary

The p99 latency alert for PUT /v1/databases/{databaseId}/tables/{tableId} on tables-service was saturating at exactly ~31s, regardless of how slow snapshot commits actually got. Investigation showed this is not a real timeout — it's a
Micrometer histogram artifact:

  • application.properties enables management.metrics.distribution.percentiles-histogram.all=true, so all timers publish histograms.
  • Micrometer's default Timer.maximumExpectedValue is 30s. The pre-computed PercentileHistogramBuckets table is truncated at this cap, with the highest finite bucket near ~30.5s (rounded to ~31s in dashboards).
  • Anything slower lands in the +Inf bucket and is invisible to histogram_quantile, so p99/max visually pin at 31s.
  • The existing override for catalog_metadata_retrieval_latency=600s proved this; the same pattern was missing for the HTTP server timer and the catalog update timer, both of which can legitimately exceed 30s during metastore contention.

Max latency is always capped at 31s although put snapshot takes higher:
image

Changes

  • Override Micrometer's default 30s maximumExpectedValue for http.server.requests and catalog_metadata_update_latency in tables-service, raising them to 600s.
  • Add tests covering both the property wiring and the resulting histogram bucket boundaries.

services/tables/src/main/resources/application.properties

  • Add maximum-expected-value.catalog_metadata_update_latency=600s
  • Add maximum-expected-value.http.server.requests=600s (Spring's auto-instrumented HTTP timer; exported to Prometheus as http_server_requests_seconds)

services/tables/src/test/java/.../MetricsHistogramConfigurationTest.java

  • Add two property-level assertions (testMaxExpectedValueConfigurationIsSetForUpdateLatency, testMaxExpectedValueConfigurationIsSetForHttpServerRequests).
  • Add two live-registry assertions verifying histogram buckets actually extend to ≥600s for the new timers.
  • Factor shared bucket-assertion logic into assertTimerHistogramExtendsTo600s(...).

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

./gradlew :services:tables:test --tests "MetricsHistogramConfigurationTest" — all 9 tests pass (3 new + 6 existing).

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Copy link
Copy Markdown
Collaborator

@cbb330 cbb330 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cbb330 cbb330 merged commit 55a7c64 into linkedin:main May 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants